On-line relational SOM for dissimilarity data 



Madalina Olteanu, Nathalie Villa- Vialaneix, and Marie Cottrell 

SAMM-Universite Paris 1 Pantheon Sorbonne 
90, rue de Tolbiac, 75013 Paris - France 
{madalina . olteanu, nathalie . villa, marie . cottrell}@univ-parisl . f r 
http : / / Scimm . univ-paris 1 . f 



Abstract. In some applications and in order to address real world sit- 
uations better, data may be more complex than simple vectors. In some 
examples, they can be known through their pairwise dissimilarities only. 
Several variants of the Self Organizing Map algorithm were introduced 
to generalize the original algorithm to this framework. Whereas median 
SOM is based on a rough representation of the prototypes, relational 
SOM allows representing these prototypes by a virtual combination of 
all elements in the data set. However, this latter approach suffers from 
two main drawbacks. First, its complexity can be large. Second, only 
a batch version of this algorithm has been studied so far and it often 
provides results having a bad topographic organization. In this article, 
an on-line version of relational SOM is described and justified. The algo- 
rithm is tested on several datasets, including categorical data and graphs, 
and compared with the batch version and with other SOM algorithms 
for non vector data. 



1 Introduction 

In many real- world applications, data cannot be described by a fixed set of 
numerical attributes. This is the case, for instance, when data are described by 
categorical variables or by relations between objects (i.e., persons involved in 
a social network). A common solution to address this kind of issue is to use 
a measure of resemblance (i.e., a similarity or a dissimilarity) that can handle 
categorical variables, graphs or focus on specific aspects of the data, designed 
by expertise knowledge. Many standard methods for data mining have been 
generalized to non vectorial data, recently including prototype-based clustering. 
The recent paper [6] provides an overview of several methods that have been 
proposed to tackle complex data with neural networks. 

In particular, several extensions of the Self-Organizing Maps (SOM) algo- 
rithm have been proposed. One approach consists in extending SOM to categor- 
ical data by using a method similar to Multiple Correspondence Analysis, [5]. 
Another approach uses the median principle which consists in replacing the stan- 
dard computation of the prototypes by an approximation in the original dataset. 
This principle was used to extend SOM to dissimilarity data in fT5]. One of the 
main drawbacks of this approach is that forcing the prototypes to be chosen 
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among the dataset is very restrictive; in order to increase the flexibility of the 
representation, [3 propose to represent a class by several prototypes, all chosen 
among the original dataset. However this method increases the computational 
time and prototypes still stay restricted to the original dataset, hence reflecting 
possible sampling or sparsity issues. 

An alternative to median-based algorithms relies on a method that is close 
to the classical algorithm used in the Euclidean case and is based on the idea 
that prototypes may be expressed as linear combinations of the original dataset. 
In the kernel SOM framework, this setting is made natural by the use of the 
kernel that maps the original data into a (large dimensional) Euclidean space 
(see |16|lj for on-line versions and [2] for the batch version). Many kernels have 
been designed to handle complex data such as strings, nodes in a graphs or 
graphs themselves [10 . 

More generally, when the data are already described by a dissimilarity that 
is not associated to a kernel, |12|18|llj use a similar idea. They introduce an 
implicit "convex combination" of the original data to extend the classical batch 
versions of SOM to dissimilarity data. This approach is known under the name 
"relational SOM". The purpose of the present paper is to show that the same 
idea can be used to define on-line relational SOM. Such an approach reduces the 
computational cost of the algorithm and leads to a better organization of the 
map. In the remaining of this article. Section [2] describes the methodology and 
Section [3] illustrates its use on simulated and real- world data. 

2 Methodology 

In the following, let us suppose that n input data, Xi, . . . , x^, from an arbitrary 
input space Q are given. These data are described by a dissimilarity matrix 
^ — {^ij)i,j=i,...,n such that D is non negative {dij > 0), symmetric {Sij = Sji) 
and null on the diagonal {Sa = 0). The purpose of the algorithm is to map these 
data into a low dimensional grid composed of U units which are linked together 
by a neighborhood relationship K{u^ u'). A prototype Pu is associated with each 
unit ixG {!,...,[/} in the grid. The U prototypes (pi,P2, • • • ,Pt/) are initialized 
either randomly among the input data or as random convex combinations of the 
input data. 

In the Euclidean framework, where the input space is equipped with a 
distance, the matrix D is the distance matrix with entries 8ij = \\xi — Xj\\^. In 
this case, the on-line SOM algorithm iterates 

— an assignment step: a randomly chosen input xi is assigned to the closest 
prototype denoted by Pf(^xi) according to shortest distance rule 

f{xi) = arg min \\xi - Pu\\, 

n=l,...,(7 

— a representation step: all prototypes are updated 
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where a is the training parameter. 

In the more general framework, where the data are known through pair- 
wise distances only, the assignment step cannot be carried out straightforwardly 
since the distances between the input data and the prototypes may not be di- 
rectly computable. The solution introduced in [18 consists in supposing that 
prototypes are convex combinations of the original data, Pu = f^mXi with 
Pui > and ^^Pui = 1. If Pu denotes the vector {Pui, Pu2, - • - , Pun), the dis- 
tances in the assignment step can be written in terms of D and Pu only: 

According to JH] , the equation above still holds if the matrix D is no longer 
a distance matrix, but a general dissimilarity matrix, as long as it is symmetric 
and null on the diagonal. A generalization of the batch SOM algorithm, called 
batch relational SOM, which holds for dissimilarity matrices is introduced in 

m- 

The representation step may also be carried out in this general framework 
as long as the prototypes are supposed to be convex combinations of the input 
data. Hence, using the same ideas as [18], we introduce the on-line relational 
SOM, which generalizes the on-line SOM to dissimilarity data. The proposed 
algorithm is the following: 



Algorithm 1 On-line relational SOM 
1: For all 1^ = 1, . . . , [/ and i = 1, . . . , n, initialize P^i randomly in R, such that P^i > 

and Er/^1 = 1- 
2: for t=l,. . . ,T do 
3: Randomly chose an input Xi 

4: Assignment : find the unit of the closest prototype 

fix.) ^ arg min (/J^'D)^ - J/3*-^D(/?r^)^ 

u—l,...,U ''A 

5: Update of the prototypes: \/u = 1, . . . , [/, 

Pi ^ I3i-' + a'K\f{x.),u) (1. - 13'-') 

where 1^ is a vector with a single non null coefficient at the ith position, equal to 
one. 
6: end for 



In the applications of Section |3j the parameters of the algorithm are chosen 
according to [4 : the neighborhood decreases in a piecewise linear way, start- 
ing from a neighborhood which corresponds to the whole grid up to a neighbor- 
hood restricted to the neuron itself; vanishes at the rate oil/t. Let us remark 
that if the dissimilarity matrix is a Euclidean distance matrix, relational on-line 
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SOM is equivalent to the classical on-line SOM algorithm, as long as the n input 
data contain a basis of the input space Q. 

As explained in [8 , although batch SOM possesses the nice properties of 
being deterministic and of usually converging in a few iterations, it has several 
drawbacks such as bad organization, bad visualization, unbalanced classes and 
strong dependence on the initialization. Moreover, the computational complexity 
of the online algorithm may be significantly reduced with respect to the batch 
algorithm. For one iteration, the complexity of the batch algorithm is 0(Un^ + 
Uv?)^ while for the online algorithm \i\sO(Uin?^Un). However, since the online 
algorithm has to scan all input data, the number of iterations is significantly 
larger than in the batch case. To summarize, if Ti is the number of iterations 
for batch relational SOM and T2 is the number of iterations for online relational 
SOM, the ratio between the two computation times will be Tin/T2. 

For illustration, let us consider 500 points sampled randomly from the uni- 
form distribution in [0, 1]^. The batch version of relational SOM and the on-line 
version of relational SOM were performed with identical 10x10 grid structures 
and identical initializations. Results are available in Figure [l] Batch relational 
SOM converged quickly, in 20 iterations (the grid organization is represented at 
iterations (random initialization), 5, 9, 13, 17 and 20), but the map is not well 
organized. On-line relational SOM converged in less than 2500 iterations (the 
grid organization is represented at iterations (initialization), 500, 1000, 1500, 
2000 and 2500), but the map is now almost perfectly organized. This results was 
achieved in 40 minutes for the batch version and in 10 minutes for the on-line 
version on a netpc (with 2 x IGHz AMD processors and 4Go RAM). 



3 Applications 



This section presents several applications of the on-line relational SOM on vari- 
ous datasets. Section [STT] deals with simulated data described by numerical vari- 



ables, but organized on a non linear surface. Section 3.2 is an application on a 
real dataset where the individuals are described by categorical variables. Finally, 
Section |3.3| is an application to the clustering of nodes of a graph. 



3.1 Swiss roll 

Let us first use a toy example to illustrate the stochastic version of relational 
SOM. The simulated data is the popular Swiss roll, a two-dimensional manifold 
embedded in a three-dimensional space. This example has already been used for 
illustrating the performances of Isomap [20^ . The data has the shape illustrated 
by Figure [2] 5 000 points were simulated. However, since all methods presented 
here work with matrices of pairwise distances, the computation times would have 
been rather heavy for 5 000 points. Hence, we run the different algorithms on 
1 000 points uniformly distributed on the manifold. First, the distance matrix 
was computed using the geodesic distance based on the i^-rule with K = 10. 
Then, two types of algorithms were performed: multidimensional scaling and 
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Batch relational SOM (20 iterations) 
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On-line relational SOM (2500 iterations) 




Fig. 1. Batch and on-line SOM organization for 500 samples from the uniform distri- 
bution in [0, 1]^. The same initialization was used for both algorithms. 



self-organizing maps. The results obtained with Isomap [20^ are available in 
Figure [2] As expected, both methods succeed in unfolding the Swiss roll and the 
results are very similar. Next, batch median SOM and on-line relational SOM 
were applied to the dissimilarity matrix computed with the geodesic distance. 
As shown in Figure |3j the size of the map plays an important role in unfolding 
the data. For squared grids, the problem is not completely solved by either of 
the two algorithms. Nevertheless, on-line relational SOM manages to project 
the different scrolls of the roll into separate regions on the map. Moreover, some 
empty cells highlight the roll structure, which is not completely unfolded but 
rather projected without overlapping. Since squared grids appeared too heavily 
constrained, we also tested rectangular grids. The results are better for both 
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algorithms which both manage to unfold the data. However, the on-line version 
clearly outperforms the batch version. 




Fig. 2. Unfolding the Swiss roll using Isomap 



a) 15xl5-grid batch median SOM 



b) 15xl5-grid on-line relational SOM 



c) 30xl0-grid batch median SOM 



b) 30xl0-grid on-Une relational SOM 
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Fig. 3. Unfolding the Swiss roll using self-organizing maps 



3.2 Amazonian butterflies 



This data set contains 465 input data and was previously used by [13 to demon- 
strate the synergy between DNA barcoding and morphological-diversity studies. 
The notion of DNA barcoding comprises a wide family of molecular and bioin- 
formatics methods aimed at identifying biological specimens and assigning them 
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to a species. According to the vast literature published during the past years 
on the topic, two separate tasks emerge for DNA barcoding: on the one hand, 
assign unknown samples to known species and, on the other hand, discover un- 
described species, [?]• The second task is usually approached with the Neighbor 
Joining algorithm [19] which constructs a tree similar to a dendrogram. When 
the sample size is large, the trees become rapidly unreadable. Moreover, they 
are quite sensitive to the order in which the input data are presented. Let us 
also mention that unsupervised learning and visualization methods are used to a 
very limited extent by the DNA barcoding community, although the information 
they bring may be quite useful. The use of self-organizing maps may be quite 
helpful in visualizing the data and bringing out clusters or groups of clusters 
that may correspond to undescribed species. 

DNA barcoding data are composed of sequences of nucleotides, i.e. sequences 
of "a", "c", "g", "t" letters in high dimension (hundreds or thousands of sites). 
Specific distances and dissimilarities such as the Kimura-2P ([TT) are usually 
computed. Hence, since the data is not Euclidean, dissimilarity-based methods 
appear to be more appropriate. Recently, batch median SOM was tested in [17] 
on several data sets, amongst which the Amazonian butterflies. Although me- 
dian SOM provided encouraging results, two main drawbacks emerged. First, 
since the algorithm was run in batch, the organization of the map was gener- 
ally poor and highly depending on the initialization. Second, since the algorithm 
calculates a prototype for each cluster among the dataset, it does not allow 
for empty clusters. Thus, the existence of species or groups of species was dif- 
ficult to acknowledge. The use of on-line relational SOM overcomes these two 
issues. As shown in Figure |4j clusters are generally not mixing species, while 
the empty cells allow detecting the main groups of species. The only mixing 
class corresponds to a labeling error. Unsupervised clustering may thus be use- 
ful in addressing misidentification issues. In Figure |4]3, distances with respect 
to the nearest neighbors were computed for each node. The distance between 
two nodes/cells is computed as the mean dissimilarity between the observations 
within each class. A polygon is drawn within each cell with vertices proportional 
to the distances to its neighbors. If two neighbor prototypes are very close, then 
the corresponding vertices are very close to the edges of the two cells. If the dis- 
tance between neighbor prototypes is very large, then the corresponding vertices 
are far apart, close to the center of the cells. 

3.3 Political books 

This application uses a dataset modeled by a graph having 105 nodes. The nodes 
are books about US politics published around the time of the 2004 presidential 
election and sold by the on-line bookseller Amazon.com. Edges between two 
nodes represent frequent co-purchasing of the two books by the same buyers. 
The graph contains 441 edges and all nodes are labeled according to their polit- 
ical orientation (conservative, liberal or neutral). The graph has been extracted 
by Valdis Krebs and can be downloaded at http://www-personal.umich.edu/ 
^-inejn/netdata/polbooks . zip, 
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a) Species diversity (radius proportional to b) Distances between prototypes 
the size of the cluster) 
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Fig. 4. On-line relational SOM for Amazonian butterflies 



On-line relational SOM was used to cluster the nodes of the graph, accord- 
ing to the length of the shortest path between two nodes, which is a standard 
dissimilarity measure between nodes in a graph. Figures [s] and [g] (left) provide 
two representations of the "political books" network: the first one is the origi- 
nal graph displayed with a force directed placement algorithm, which is the one 
described in [9] and colored according to the clusters in which the nodes are 
classified. The second one is a simplified representation of the graph on the grid, 
where each node represents a cluster. The colors in the first figure and the den- 
sity of edges in the second one shows that the clustering has a good organization 
on the grid, according to the graph structure: groups of nodes that are densely 
connected are classified in the same or in close clusters whereas groups of nodes 
that are not connected are classified apart. 

Additionally, Figure [6] provides the distribution of the node labels inside each 
cluster for the obtained clustering (on the right hand part of the figure) . Almost 
all clusters contain books having the same political orientation. Clusters that 
contain books with multiple political orientations are in the middle of the grid 
and include neutral books. Hence, this clustering can give a clue on a more subtle 
political orientation than the original labeling: for instance, liberal books from 
cluster 12 probably have a weaker commitment that those from clusters 1 or 2. 



4 Conclusion 



An on-line version of relational SOM is introduced in this paper. It combines the 
standard advantages of the stochastic version of the SOM (better organization 
and faster computation) with the relational SOM that is able to handle data de- 
scribed by a dissimilarity. The algorithm shows good performances in projecting 
data described either by numerical variables or by categorical variable, as well 
as in clustering the nodes of a graph. 
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Fig. 5. "Political books" network displayed with a force directed placement algorithm. 
The nodes are labeled according to their political orientation and are colored according 
to a gradient that aims at emphasizing the distance between clusters on the grid, as 
represented at the top the figure. 




Fig. 6. Left: Simplified representation of the graph on the grid: each node represents a 
cluster whose area is proportional to the number of nodes included in it and the edges 
width represents the number of edges between the nodes of the corresponding cluster. 
Right: Distribution of the node labels for each neuron of the grid for the clustering 
obtained with the dissimilarity based on the length of the shortest paths. Red is for 
liberal books, blue for conservative books and green for neutral books. 
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